Mountain Bike (MTB) Categorization Analysis

Introduction

Project Overview

For this project, our team will determine whether the specifications of mountain bikes (MTB) are enough to differentiate between the different types of mountain bike categories.

Currently, full suspension mountain bikes come in multiple categories:

  • Cross Country (XC) | Tend to be the most lightweight, nimble, and designed to put the rider in an efficient pedaling position
  • Enduro (EN) | Heavier frames, more travel and more downhill oriented geometry
  • Trail (TR) | The most common category of bikes, considered to be the halfway point between XC and Enduro
  • All Mountain (AM) | A more niche category which some manufacturers claim to be more downhill focused than trail bikes, but not designed for downhill races like Enduro bikes are
  • Downcountry (DC) | A relatively new category between XC and Trail. Similar to the All Mountain category, these bikes aren’t race specific like XC bikes tend to be, but are lighter and faster than trail bikes.

With all of the factors to consider when designing a bike, there are no clear boundaries between these categories. For example, one brand’s Downcountry bike could be what another brand considers a Trail bike.

The goal of our project is to determine how many, if any, discrete categories should exist for mountain bikes. Since most specifications and geometric measurements have one direction when moving across the spectrum of bikes, it’s reasonable to believe that these measurements could be reduced to much fewer dimensions, and perhaps even one continuous principle component rather than discrete categories. Here is a diagram of some of the different types of geometric specifications on mountain bikes:

Various Dimension Features of a Bike’s Geometry

The Data

The data was retrieved manually from each of the mountain bike company’s websites. Let’s take a look at the data.

# Read in sheet 2 of our data
mtb_data <- read_excel(here::here('Data/mtb_stats.xlsx'), 'Sheet1')
mtb_data <- mtb_data %>% 
  # Clean up the label column
  mutate(label = str_replace_all(str_to_lower(label), '[:punct:]', ''),
         # Create a feature for the long-version of the names
         bike_category = case_when(
          label == 'tr' ~ 'Trail',
          label == 'xc' ~ 'Cross Country',
          label == 'dc' ~ 'Downcountry',
          label == 'am' ~ 'All Mountain',
          label == 'en' ~ 'Enduro',
          TRUE ~ 'red'
        ))

# Pull out the class labels
labels <- mtb_data %>% 
  select(label)


# Let's view the mtb_data output
# In any kable outputs, display NAs as blanks
opts <- options(knitr.kable.NA = "")

mtb_data %>% 
  head(25) %>%
  # Fix up the headers by replacing the underscores with spaces
  rename_all(funs(str_replace_all(., "_", " "))) %>% 
  # Make everything proper capitalization
  # rename_all(funs(str_to_title)) %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = F,
                font_size = 12) %>%
  # Make the header row bold and black so it's easier to read
  row_spec(0, bold = T, color = "black") %>% 
  scroll_box(height = "400px", width = "100%")
model brand build type price url image setting size used label rear travel fork travel f piston f rotor dim r piston r rotor dim head angle seat angle crank length stem length handlebar width reach stack wheelbase chainstay length bb height standover height bike category
habit cannondale L tr 130 140 4 180 4 180 66.0 74.5 780 460.0 625.0 1210.0 435.0 339.0 770.0 Trail
scalpel cannondale L xc 100 100 2 160 2 160 68.0 74.5 80 760 435.0 601.0 1175.0 436.0 331.0 745.0 Cross Country
scalpel se cannondale L dc 120 120 2 160 2 160 67.0 74.0 780 450.0 611.0 1172.0 436.0 344.0 758.0 Downcountry
reign advanced pro giant L en 146 170 4 203 4 203 64.6 76.4 40 800 488.0 631.0 1262.0 439.0 781.0 Enduro
trance advanced X pro giant high L tr 135 150 4 203 4 180 66.2 77.9 50 800 494.0 624.0 1238.0 435.0 761.0 Trail
trance advanced X pro giant low L tr 135 150 4 203 4 180 65.5 77.2 50 800 486.0 631.0 1239.0 438.0 752.0 Trail
anthem advanced pro giant L xc 90 100 2 180 2 160 69.0 73.5 80 780 454.0 594.0 1154.0 438.0 817.0 Cross Country
jet 9 rdo niner high M tr 120 130 4 180 4 180 66.5 76.0 40 800 450.0 613.0 1179.0 430.0 698.0 Trail
jet 9 rdo niner low M tr 120 130 4 180 4 180 66.0 75.5 40 800 444.0 617.0 1180.0 432.0 705.0 Trail
rip 0 rdo niner high M tr 140 150 4 180 4 180 66.0 75.8 800 440.0 615.0 1181.0 435.0 712.0 Trail
rip 0 rdo niner low M tr 140 150 4 180 4 180 65.0 75.2 800 433.0 619.0 1182.0 435.0 705.0 Trail
rkt 9 rdo niner M dc 90 120 4 180 4 160 70.0 73.5 780 413.0 617.0 1111.0 439.0 739.0 Downcountry
rkt 9 rdo rs niner M xc 90 100 4 180 4 160 71.0 74.5 780 424.0 600.0 1103.0 439.0 728.0 Cross Country
megatower santa cruz L en 160 160 4 200 4 200 65.0 76.6 470.0 625.0 1231.0 435.0 343.0 713.0 Enduro
tallboy santa cruz L tr 120 130 4 180 4 180 65.7 76.4 50 800 470.0 619.0 1211.0 430.0 335.0 706.0 Trail
hightower santa cruz L tr 145 150 4 180 4 180 65.5 76.8 50 780 473.0 619.0 1231.0 433.0 344.0 717.0 Trail
blur santa cruz L xc 100 100 2 160 2 160 69.0 74.0 750 460.0 598.0 1160.0 432.0 328.0 723.0 Cross Country
blur tr santa cruz L dc 115 120 2 180 2 180 67.1 74.9 175 60 760 457.5 606.5 1183.2 435.8 339.6 745.4 Downcountry
ransom scott L/29 en 170 170 4 203 4 180 64.5 75.0 50 800 466.5 627.6 1249.2 437.9 353.0 760.9 Enduro
spark scott L tr 120 130 4 180 4 180 67.2 73.8 70 760 460.0 602.4 1182.8 438.0 327.0 778.0 Trail
genius scott high L tr 150 150 4 203 4 180 65.6 75.3 50 780 472.0 609.2 1230.8 436.0 340.0 749.5 Trail
genius scott low L tr 150 150 4 203 4 180 65.0 74.8 50 780 466.1 613.7 1232.1 438.0 345.9 758.4 Trail
spark rc scott L xc 100 110 2 180 2 160 68.5 73.8 80 740 456.8 596.2 1158.6 435.0 319.5 756.0 Cross Country
epic evo specialized high m dc 110 120 4 180 4 160 67.0 74.5 175 60 760 436.0 597.0 1164.0 438.0 339.0 781.0 Downcountry
epic evo specialized low m dc 110 120 4 180 4 160 66.5 74.5 175 60 760 436.0 597.0 1164.0 438.0 336.0 781.0 Downcountry

EDA

In this section, we’ll take a look at the 72 mountain bikes in our dataset and some of the 27 features. We’ll try to break down our understanding of the data in terms of label, our target variable that acts as the category for each mountain bike.

Label (Mountainbike Category)

As stated earlier, there are 5 mountain bike categories in our dataset:

  1. Cross Country (xc)
  2. Enduro (en)
  3. Trail (tr)
  4. All Mountain (am)
  5. Downcountry (dc)

Let’s look at how many of each we have in our dataset.

mtb_data %>% 
  group_by(bike_category) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  # Start our visualization, creating our groups by party affiliation
  ggplot(aes(y = forcats::fct_reorder(bike_category, n), x = n)) +
  geom_col(fill = "slateblue", na.rm = T) +
  # Add a label by recreating our data build from earlier
  geom_label(aes(label = n),
             size = 5,
             # Scooch the labels over a smidge
             hjust = .25) +
  # Let's change the names of the axes and title
  xlab("Number of Bikes") +
  ylab("Category (label)") +
  labs(title = "Number of Mountain Bikes per Category")

We see that out of our 72 bikes, most of them are Trail bikes, with the smallest grouping of bikes being all mountain bikes

Categorical Variables

There are 4 categorical variables we’ll take a look at to better understand our data:

  1. Setting
  2. Size
  3. Front Piston (f_piston)
  4. Rear Piston (r_piston)
mtb_data %>% 
  select(-label, -bike_category) %>% 
  DataExplorer::plot_bar(ggtheme = theme_classic(),
                         title = 'Distribution of Categorical Variables',
                         theme_config = theme(plot.title = element_text(hjust = 0, 
                                                                            color = "slateblue4", 
                                                                            size = 24),
                                                  plot.subtitle = element_text(hjust = 0, color = "slateblue2", size = 10),
                                                  plot.caption = element_text(color = "dark gray", size = 10, face = "italic"),
                                                  axis.title.x = element_text(size = 14),
                                                  axis.title.y = element_text(size = 14)),
                         maxcat = 15,
                         ncol = 2)

  • We see that most of the bikes don’t have a setting, which is used to help differentiate between different versions of the same bike. Later on, we’ll group by settings for the same bike and average the results to get a more accurate representation of the bikes’ specs.
  • Most of the bikes analyzed have 4 rear/front pistons. The two variables seem to be perfectly in-sync, leading us to believe that they’re highly correlated.

But, really, we care about understanding how these different variables interact with our target variable, label. Let’s look at their distribution and look for any patterns.

mtb_data %>% 
  DataExplorer::plot_bar(ggtheme = theme_classic(),
                         by = 'label',
                         by_position = 'fill',
                         title = 'Distribution of Categorical Variables',
                         theme_config = theme(plot.title = element_text(hjust = 0, 
                                                                            color = "slateblue4", 
                                                                            size = 24),
                                                  plot.subtitle = element_text(hjust = 0, color = "slateblue2", size = 10),
                                                  plot.caption = element_text(color = "dark gray", size = 10, face = "italic"),
                                                  axis.title.x = element_text(size = 14),
                                                  axis.title.y = element_text(size = 14)),
                         maxcat = 15,
                         ncol = 2)

Here we see:

  • The size used for most of the bikes is pretty evenly distributed. For the most part, we attempted to find bikes that are sized to the heights of the authors of this report (approx. 5’8”-5’11”), which tended to be Large-sized bikes; however, for some bikes, like Trail, the specific bike’s company website from which we pulled the data recommended a Medium-sized bike.
  • Although most of the bikes are 4-piston bikes, of the bikes that have 2 pistons, most are Cross Country (xc) bikes.

Continuous Variables

To analyze the continuous features within our dataset, we built density plots for each of them to better understand their distribution.

DataExplorer::plot_density(mtb_data,
                             ggtheme = theme_classic(),
                             title = 'Distribution of Continuous Variables',
                             geom_density_args = list(fill = 'slateblue'),
                             theme_config = theme(plot.title = element_text(hjust = 0, 
                                                                                color = "slateblue4", 
                                                                                size = 24),
                                                      plot.subtitle = element_text(hjust = 0, color = "slateblue2", size = 10),
                                                      plot.caption = element_text(color = "dark gray", size = 10, face = "italic"),
                                                      axis.title.x = element_text(size = 14),
                                                      axis.title.y = element_text(size = 14)),
                             ncol = 3)

~Normally Distributed Variables:

  • Chainstay_length
  • Fork_travel
  • Bb_height
  • Seat_angle

Skewed Variables:

  • Head_angle (skewed right)
  • Handlebar_width (skewed left)
  • Wheelbase (skewed left)

Multi-Modal Distributed Variables:

  • f_rotor_dim / r_rotor_dim
  • Stem_length

Like we did for continuous variables, let’s look at the distribution of each of these predictors by our target variable, label, to look for any discernible patterns.

mtb_data %>% 
  DataExplorer::plot_boxplot(by = 'label',
                             geom_boxplot_args = list('fill' = 'slateblue'),
                           ggtheme = theme_classic(),
                           theme_config = theme(plot.title = element_text(hjust = 0, 
                                                                          color = "slateblue4", 
                                                                          size = 24),
                                                plot.subtitle = element_text(hjust = 0, color = "slateblue2", size = 10),
                                                plot.caption = element_text(color = "dark gray", size = 10, face = "italic"),
                                                axis.title.x = element_text(size = 14),
                                                axis.title.y = element_text(size = 14)),
                           ncol = 3)

Here we see:

  • Cross Country (xc) bikes tend to have the largest head angle and smallest seat angle compared to other bikes. They also have the largest stem length by a significant margin. Overall, Cross Country bikes tend to be the most differentiable from other bike categories;
  • All Mountain (am) bikes have a significantly smaller standover height and, along with Enduro (en) bikes, have a much larger reach than other bike categories;
  • As is generally expected, Trail (tr) bikes tend to fit mostly in the middle for most of these continuous’ variables. This makes sense given that they tend to split the difference between Cross Country and Enduro bikes.

Average bikes by flip-chip setting

Because some bikes’ websites would have two different “settings” for the same-sized bike, we opted to include both options and average the two together to get one middle-of-the-road estimate for that type of bike.

# Split data based on setting vs. no setting
no_setting <- mtb_data %>% 
  filter(is.na(setting))
setting <- mtb_data %>% 
  filter(!is.na(setting))



setting <- cbind(setting$model, setting$label, select_if(setting, is.numeric))
setting$model <- setting$`setting$model`
setting <- setting %>% select(-`setting$model`)
setting$label <- setting$`setting$label`
setting <- setting %>% select(-`setting$label`)

mean_by_setting <- aggregate(x=select(setting, -c(model, label)),
                             by=list(setting$model, setting$label),
                             FUN=mean)
mean_by_setting$model <- mean_by_setting$Group.1
mean_by_setting$label <- mean_by_setting$Group.2
mean_by_setting <- mean_by_setting %>% select(-c(Group.1, Group.2))

no_setting <- cbind(no_setting$model, no_setting$label, select_if(no_setting, is.numeric))
no_setting$model <- no_setting$`no_setting$model`
no_setting <- no_setting %>% select(-`no_setting$model`)
no_setting$label <- no_setting$`no_setting$label`
no_setting <- no_setting %>% select(-`no_setting$label`)

new_mtb_data <- data.frame(rbind(mean_by_setting, no_setting))

rownames(new_mtb_data) <- new_mtb_data$model

Because some bikes’ websites would have two different “settings” for the same-sized bike, we opted to include both options and average the two together to get one middle-of-the-road estimate for that type of bike. We end up performing this operation for 43% of the bikes in our dataset.


Methodology

Now that we have a better understanding of our mountain bike dataset, we’ll formulate a plan to prove the following hypothesis:

Applying our own clustering algorithms will either give us a different set number of clusters (rather than the 5 pre-ordained categories) OR will not provide clearly defined clusters, leading us to believe that the bikes are actually created on a spectrum and cannot be grouped into one of the 5 pre-ordained categories.

To do so, we’ll:
- Try to use various methods to reduce the featureset and see if there are certain variables that can better be used to differentiate between different mountain bike categories. - Apply various clustering and classification algorithms, including K-Means Clustering, Gaussian Mixture Models, and Multi-class Support Vector Machine, to disprove the notion that 5 distinct categories of Mountain Bikes exist.


Variation Amongst Featureset

The first thing we’ll do is look to see if any of the features in our dataset are better at explaining the variation amongst the different bikes than other features. That is, it’s completely possible that two features are similar and don’t have much variation in them, even across some of the different bike categories. To do so, we’ll:

  1. Look for highly correlated features and flag these for potential removal;
  2. Run Principal Component Analysis (PCA) to see if certain features are better at explaining the variation in our data better than others.

1. Correlation

First, let’s take a look at our most highly correlated features. We’ll use the corrplot() function to better order the highly correlated features by the angular order of their eigenvectors.

mtb_correlation <- mtb_data %>% 
  # Get rid of price for now
  select(-price) %>% 
  # Select our variables of interest
  select_if(is.numeric) %>% 
  # Remove rows with NAs in them
  # drop_na() %>% 
  # Build our correlation matrix, such that missing values are handled by casewise deletion
  cor(use = 'complete.obs') 

# Convert our results into a tibble for easier manipulation
mtb_correlation_df <- mtb_correlation %>% 
  as_tibble() %>% 
  mutate(variable = colnames(mtb_correlation)) %>% 
  relocate(variable, everything())

# Build our correlation plot, using the angular order of the eigenvectors
corrplot(mtb_correlation,
         diag = F,
         col = COL2('PRGn'),
         tl.col = 'slateblue4',
         type = 'lower',
         method = 'color',
         order = 'AOE',
         title = 'Mountain Bike Feature Correlation'
         )

Here we see some obvious correlations, for example:

  • f_piston (front brakes) is perfectly correlated with r_piston (rear brakes), which makes sense since mountain bikes tend to use the same types/spec of brakes for the front vs. rear tires.
  • fork_travel has a correlation above .95 with: fork_travel. This make sense; for example, rear_travel should be highly correlated with fork_travel.

In all, here are the most highly correlated variables (i.e. variables which have a correlation above .9 or below -.9):

mtb_correlation_df %>% 
  pivot_longer(-variable, 
               names_to = 'correlated_variable', 
               values_to = 'correlation') %>% 
  filter(variable != correlated_variable) %>% 
  # Sort by the absolute value of correlation
  arrange(desc(abs(correlation))) %>% 
  filter((correlation > .90) | (correlation < -.90)) %>% 
  # Get rid of duplicative rows
  dplyr::distinct(correlation, .keep_all = T) %>% 
  pander()
variable correlated_variable correlation
f_piston r_piston 1
stack wheelbase 0.9512
seat_angle stem_length -0.9506
f_rotor_dim wheelbase 0.9465
rear_travel wheelbase 0.9458
fork_travel wheelbase 0.9371
fork_travel head_angle -0.9364
rear_travel head_angle -0.934
rear_travel fork_travel 0.9304
stem_length handlebar_width -0.9247
rear_travel f_rotor_dim 0.9243
fork_travel f_rotor_dim 0.9241
head_angle wheelbase -0.9115
f_rotor_dim head_angle -0.9046
reach stack 0.9039
seat_angle handlebar_width 0.9018

There are a lot, especially given that we only have 18 continuous columns in our dataset! For now, we’ll opt to include everything. But later on, as we analyze the importance of different features, we’ll look to remove some of the above variables first.

2. Principal Component Analysis (PCA)

Next, we’ll apply PCA to our dataset. In so doing, we’ll have to center and scale our data given how different the ranges are for certain measurements. Let’s take a look at our 5 principal components which explain the largest proportion of variance in the data:

# Impute missing values with column mean (not really best practice, but good enough)
for (c in 1:ncol(new_mtb_data)){
  if (is.numeric(unlist(new_mtb_data[,c]))){
    # print(colnames(new_mtb_data)[c])
    new_mtb_data[is.na(new_mtb_data[,c]), c] <- mean(unlist(new_mtb_data[,c]), na.rm=TRUE)  
  }
}

mtb_no_null <- new_mtb_data %>% 
                select(-price) %>%
                select_if(is.numeric) %>% 
                bind_cols(label = new_mtb_data$label) %>%
                drop_na()

mtb_pca <- prcomp(mtb_no_null %>% select(-label),
                  center = TRUE,
                  scale. = TRUE)

# Put our summary results into a dataframe
mtb_pca_df <- tibble(variable = c('Standard Deviation', 'Proportion of Variance', 'Cumulative Proportion')) %>% 
  bind_cols(summary(mtb_pca)$importance)


mtb_pca_df %>% 
  # Only display the first 6 columns
  select(c(variable:PC5)) %>% 
  pander()
variable PC1 PC2 PC3 PC4 PC5
Standard Deviation 2.972 1.301 1.211 1.089 0.8863
Proportion of Variance 0.5194 0.09957 0.0862 0.06972 0.0462
Cumulative Proportion 0.5194 0.619 0.7052 0.7749 0.8211
mtb_pca_df %>% 
  # Pivot our data so it's easier to visualize
  pivot_longer(-variable, 
               names_to = 'PC',
               names_prefix = 'PC') %>% 
  # Make the principal component column an integer so ggplot orders it from 1:17 properly
  mutate(PC = as.integer(PC),
         # Convert value to % (multiply by 100) so it's not a decimal
         value = 100*value) %>% 
  filter(variable == 'Proportion of Variance') %>% 
  ggplot(aes(x = PC, y = value)) +
  geom_point(size = 4, color = 'slateblue') +
  geom_line(alpha = .6, lwd = 2, color = 'slateblue') + 
  labs(title = 'Proportion of Variance Explained by Principal Components',
       x = 'Principal Component',
       y = 'Proportion of Variance (%)')

We can see that, actually, that our 1st principal component alone explains more than half our data. Starting at the 2nd principal component, there’s a distinguishable elbow point. After that, we have a huge drop-off. Starting at our 5th principal component, nearly 82.1% of the data’s variation is properly explained. This leads us to believe that the majority of the variation in our data can be explained by using just 1 principal component!

Let’s take a look at how our top 2 principal components explain the 5 different mountain bike categories:

p_load(devtools,
       ggbiplot)

ggbiplot(mtb_pca,
              obs.scale = 1,
              var.scale = 1,
              groups = mtb_no_null$label,
              ellipse = TRUE,
              circle = FALSE,
              ellipse.prob = .5) + 
  theme(legend.direction = 'horizontal',
               legend.position = 'top')

# jpeg('../Images/pca.jpg')

Here we can see that our top 2 principal components, which explain roughly 61.9% of the variation in our data, are already pretty good representations for describing the different components in our dataset. Even so, the groupings are distinctly plotted on the 2-D graph.


Clustering

Because we are investigating the validity of mountain bike categories, one approach is to treat this dataset as unsupervised, stripping the bikes of their label and seeing if various clustering algorithms can re-create the 5 distinct labels.

K-Means

# How many clusters are necessary? 4

mtb_numeric <- mtb_no_null %>% 
  select(-label)
mtb_standard_scaled <- scale(mtb_numeric)

mtb_numeric <- mtb_no_null %>% 
  select(-label)

mtb_numeric <- mtb_no_null %>% 
  select(-label)

clusters <- 1:10
dists <- c()
for (c in 1:10){
  km <- kmeans(mtb_standard_scaled, centers=c, iter.max=1000)
  dists <- c(dists, km$tot.withinss)
}

# jpeg('../Images/Kmeans.jpg')
# plot(clusters, dists, type='l', xlab='Clusters', ylab='Total Sum of Squared Euclidean Distances')

# Plot our results
tibble(clusters = clusters,
       dists = dists) %>% 
  ggplot(aes(x = clusters, y = dists)) + 
  geom_point(size = 4, color = 'slateblue') +
  geom_line(alpha = .6, lwd = 2, color = 'slateblue') + 
  labs(title = "K-Means Clustering of MTB Data",
       subtitle = 'Method uses `tot.withinss` parameter to measure distances.',
       x = 'Clusters',
       y = 'Total Sum of Squared Euclidean Distances')

# Let's see where these clusters would end up on the 2D PCA plot
mtb_pca_scaled <- prcomp(mtb_standard_scaled,
                  center = F,
                  scale. = F)

pca_2_scaled <- as.matrix(mtb_standard_scaled) %*% as.matrix(mtb_pca_scaled$rotation[,1:2])

pca_km_scaled <- kmeans(pca_2_scaled, centers=3, iter.max=1000)


# Bring our PCA and k-means clusters results into our dataset
new_mtb_data %>% 
  bind_cols(pca_2_scaled) %>% 
  mutate(# Create a feature for the long-version of the names
         bike_category = case_when(
          label == 'tr' ~ 'Trail',
          label == 'xc' ~ 'Cross Country',
          label == 'dc' ~ 'Downcountry',
          label == 'am' ~ 'All Mountain',
          label == 'en' ~ 'Enduro',
          TRUE ~ 'red'
        ),
        # Bring our clusters in as a factor
        cluster = as.factor(pca_km_scaled$cluster)) %>% 
  # GG-plot our shit
  ggplot() +
    geom_point(aes(x = PC1, 
                   y = PC2, 
                   color = cluster, 
                   shape = bike_category), 
               alpha = .9, 
               size = 3) +
    # Add our cluster centers in as well
    geom_point(data = as_tibble(pca_km_scaled$centers) %>%
                 mutate(cluster = as.factor(c(1, 2, 3))), 
               aes(x = PC1, 
                   y = PC2,
                   color = cluster), 
               shape = 10, 
               size = 7) + 
  # Color clusters accordingly
    scale_color_manual(values = c('slateblue4', 'gray', 'slateblue1'), name = 'Cluster') +
    labs(title = "K-Means Clustering of MTB Principal Components",
       subtitle = 'Clusters denoted by color;\nBike categories denoted by shape;\nCluster centers denoted by large cross-hairs shape.',
       x = 'Principal Component 1',
       y = 'Principal Component 2')

#TODO let's look at this bottom cluster - both Niner bikes
# Niner has low reach numbers on its bikes - could be because we used the Medium for these!
# Based on PCA mapping, the blur tr, expic, Exie, Ripley, and Element all have less chainstay length, and less pistons?? wow, should we exclude piston count?? with more 2 piston bikes getting added, it evens out the average, so these aren't showing up as much anymore

Above, we attempted to graph the 3 clusters created for the top 2 principal components in our data. For example, we can see Cluster #1 on the left-hand side of the chart, mostly composed of Cross Country bikes (diamonds in the chart) and some Downcountry bikes (denoted by squares). Downcountry bikes also seem to be part of Cluster #2 (gray points), along with Trail bikes (denoted by squares with an ‘x’ in them). However, Trail bikes also feature heavily in Cluster #3 along with Enduro bikes (denoted by ‘+’).

Overall, it’s clear that there is significant overlap between our Clusters, mainly along the Principal Component 1 axis; lending credence to the notion that our bikes can be differentiated along a single, continuous scale.

Note: In the bottom-left of the graph (PC2 < -4), we see two Niner bikes, almost acting as outliers. Niner has low reach numbers on its bikes, which could be because we used the Medium size for these rather than the Large size!

Gaussian Mixture Model (GMM)

In this section, we’ll take a more probabilistic model to our clustering. That is, we’ll use a Guassian Mixture Model (GMM) to build out normally distributed subgroupings within our mountain bike dataset, where the densities of each of the subgroupings represents a probability that a bike belongs to that subgrouping. Unlike K-Means, which is a more centroid-based clustering method, GMM is more of a distribution-based clustering method.

Generally, what we expect to see is something like the following:

where, given a specific type of bike, we can predict the probability, \(p(x)\) that a bike belongs to a category like Cross Country (xc) vs. Trail vs. Enduro.

p_load(ClusterR)

# Build our GMM model
mtb_gmm <- GMM(mtb_standard_scaled,
               dist_mode = 'eucl_dist', # Distance metric to use during seeding of initial means clustering
               seed_mode = 'random_subset', # How initial means are seeded prior to EM alg
               km_iter = 10, # Num of iterations of K-Means alg
               em_iter = 10, # Num of iterations of EM alg
               verbose = T
               )
## gmm_diag::learn(): generating initial means
## gmm_diag::learn(): k-means: n_threads: 1
## gmm_diag::learn(): k-means: iteration:    1   delta: 5.7245
## gmm_diag::learn(): k-means: iteration:    2   delta: 6.42421e-34
## gmm_diag::learn(): generating initial covariances
## gmm_diag::learn(): EM: n_threads: 1
## gmm_diag::learn(): EM: iteration:    1   avg_log_p: -23.9741
## gmm_diag::learn(): EM: iteration:    2   avg_log_p: -23.9741
## 
## time to complete : 7.3667e-05
mtb_gmm_pred <- predict(mtb_gmm, mtb_standard_scaled)

opt_gmm <- Optimal_Clusters_GMM(mtb_standard_scaled, 
                               max_clusters = 20, 
                               criterion = "BIC", 
                               dist_mode = "eucl_dist", 
                               seed_mode = "random_subset",
                               km_iter = 10, 
                               em_iter = 10, 
                               var_floor = 1e-10, 
                               plot_data = T)

Use the mclust package in R, which utilizes Bayesian Information Criterion (BIC) to optimize the number of clusters.

p_load(mclust)

mtb_gmm2 <- Mclust(mtb_standard_scaled)

#or specify number of clusters
# mb3 = Mclust(iris[,-5], 3)

# optimal selected model
# mtb_gmm2$modelName

# optimal number of cluster
# mtb_gmm2$G

# probality for an observation to be in a given cluster
# head(mtb_gmm2)

# get probabilities, means, variances
summary(mtb_gmm2, parameters = TRUE)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust XXX (ellipsoidal multivariate normal) model with 1 component: 
## 
##  log-likelihood  n  df      BIC      ICL
##        1095.022 58 170 1499.768 1499.768
## 
## Clustering table:
##  1 
## 58 
## 
## Mixing probabilities:
## 1 
## 1 
## 
## Means:
##                                       [,1]
## rear_travel      -0.0000000000000010855331
## fork_travel       0.0000000000000005545634
## f_piston         -0.0000000000000010381606
## f_rotor_dim      -0.0000000000000007034374
## r_piston         -0.0000000000000010381606
## r_rotor_dim      -0.0000000000000041790108
## head_angle        0.0000000000000013484653
## seat_angle       -0.0000000000000023154629
## crank_length     -0.0000000000000018577082
## stem_length      -0.0000000000000001790994
## handlebar_width   0.0000000000000039486004
## reach             0.0000000000000013751813
## stack             0.0000000000000068490980
## wheelbase         0.0000000000000078192571
## chainstay_length -0.0000000000000365012607
## bb_height        -0.0000000000000083943779
## standover_height  0.0000000000000022679740
## 
## Variances:
## [,,1]
##                  rear_travel fork_travel    f_piston f_rotor_dim    r_piston
## rear_travel       0.98275862  0.86647130  0.53074705  0.72270044  0.53074705
## fork_travel       0.86647130  0.98275862  0.59701027  0.78438697  0.59701027
## f_piston          0.53074705  0.59701027  0.98275862  0.57760629  0.98275862
## f_rotor_dim       0.72270044  0.78438697  0.57760629  0.98275862  0.57760629
## r_piston          0.53074705  0.59701027  0.98275862  0.57760629  0.98275862
## r_rotor_dim       0.77073412  0.79437848  0.43519186  0.74245994  0.43519186
## head_angle       -0.81199232 -0.83055385 -0.44808970 -0.66046683 -0.44808970
## seat_angle        0.59460756  0.63732989  0.36523677  0.56513695  0.36523677
## crank_length     -0.05247594 -0.05329849 -0.01705486 -0.04492137 -0.01705486
## stem_length      -0.52751652 -0.54137826 -0.45331950 -0.37291853 -0.45331950
## handlebar_width   0.61297175  0.66689437  0.53800825  0.56677331  0.53800825
## reach             0.50878091  0.46844618  0.12052159  0.41797347  0.12052159
## stack             0.62986365  0.70353377  0.40627172  0.46370383  0.40627172
## wheelbase         0.81153584  0.80687536  0.36946161  0.65071880  0.36946161
## chainstay_length  0.21765491  0.24627225  0.23496345  0.33487904  0.23496345
## bb_height         0.60876155  0.66944392  0.45592502  0.52235140  0.45592502
## standover_height -0.17884464 -0.08927132 -0.28317481 -0.04107194 -0.28317481
##                  r_rotor_dim    head_angle  seat_angle  crank_length
## rear_travel       0.77073412 -0.8119923156  0.59460756 -0.0524759431
## fork_travel       0.79437848 -0.8305538489  0.63732989 -0.0532984850
## f_piston          0.43519186 -0.4480897037  0.36523677 -0.0170548600
## f_rotor_dim       0.74245994 -0.6604668277  0.56513695 -0.0449213682
## r_piston          0.43519186 -0.4480897037  0.36523677 -0.0170548600
## r_rotor_dim       0.98275862 -0.7644254772  0.63374974 -0.0175895936
## head_angle       -0.76442548  0.9827586207 -0.74721358 -0.0007695164
## seat_angle        0.63374974 -0.7472135761  0.98275862 -0.0512057171
## crank_length     -0.01758959 -0.0007695164 -0.05120572  0.9827586207
## stem_length      -0.39794105  0.5763723220 -0.52789789 -0.0783916344
## handlebar_width   0.56297068 -0.6713014498  0.52755131  0.0926088407
## reach             0.45588325 -0.5810910021  0.57203070 -0.0088706963
## stack             0.59439229 -0.6996862164  0.62242136  0.1205464243
## wheelbase         0.71092923 -0.8998065172  0.72976412 -0.0282440001
## chainstay_length  0.15814975 -0.2217584548  0.06649237 -0.0009586544
## bb_height         0.47593674 -0.5987405494  0.45187614  0.1300968576
## standover_height -0.11335279  0.1344210160 -0.28064197 -0.0271607065
##                  stem_length handlebar_width        reach      stack  wheelbase
## rear_travel      -0.52751652      0.61297175  0.508780910  0.6298637  0.8115358
## fork_travel      -0.54137826      0.66689437  0.468446179  0.7035338  0.8068754
## f_piston         -0.45331950      0.53800825  0.120521586  0.4062717  0.3694616
## f_rotor_dim      -0.37291853      0.56677331  0.417973470  0.4637038  0.6507188
## r_piston         -0.45331950      0.53800825  0.120521586  0.4062717  0.3694616
## r_rotor_dim      -0.39794105      0.56297068  0.455883252  0.5943923  0.7109292
## head_angle        0.57637232     -0.67130145 -0.581091002 -0.6996862 -0.8998065
## seat_angle       -0.52789789      0.52755131  0.572030701  0.6224214  0.7297641
## crank_length     -0.07839163      0.09260884 -0.008870696  0.1205464 -0.0282440
## stem_length       0.98275862     -0.59053686 -0.329975492 -0.6565999 -0.5781065
## handlebar_width  -0.59053686      0.98275862  0.322740937  0.6070998  0.6070346
## reach            -0.32997549      0.32274094  0.982758621  0.4927676  0.6938129
## stack            -0.65659992      0.60709976  0.492767604  0.9827586  0.7492409
## wheelbase        -0.57810653      0.60703461  0.693812924  0.7492409  0.9827586
## chainstay_length -0.10000081      0.34555985 -0.046796458  0.1998316  0.2450669
## bb_height        -0.53601700      0.69945043  0.313913983  0.5873574  0.5877167
## standover_height  0.30985818     -0.22425359 -0.112561218 -0.0762681 -0.0522580
##                  chainstay_length  bb_height standover_height
## rear_travel          0.2176549122  0.6087615      -0.17884464
## fork_travel          0.2462722471  0.6694439      -0.08927132
## f_piston             0.2349634516  0.4559250      -0.28317481
## f_rotor_dim          0.3348790372  0.5223514      -0.04107194
## r_piston             0.2349634516  0.4559250      -0.28317481
## r_rotor_dim          0.1581497490  0.4759367      -0.11335279
## head_angle          -0.2217584548 -0.5987405       0.13442102
## seat_angle           0.0664923694  0.4518761      -0.28064197
## crank_length        -0.0009586544  0.1300969      -0.02716071
## stem_length         -0.1000008062 -0.5360170       0.30985818
## handlebar_width      0.3455598514  0.6994504      -0.22425359
## reach               -0.0467964577  0.3139140      -0.11256122
## stack                0.1998316255  0.5873574      -0.07626810
## wheelbase            0.2450668561  0.5877167      -0.05225800
## chainstay_length     0.9827586207  0.3448505       0.29091329
## bb_height            0.3448504748  0.9827586      -0.20756940
## standover_height     0.2909132943 -0.2075694       0.98275862
plot(mtb_gmm2, 'classification')

Multi-class SVM

p_load(e1071,
       caret)

#convert all mountain category to enduro, dc -> Xc?
remap <- function(x, num){
  if (x=='am' || x=='en'){
    if (num){
      return(4)
    }
    else{
      return('Enduro')
    }
  }
  else if(x=='xc'){
    if(num){
      return(1)
    }
    else{
      return('Cross Country')
    }
  }
  else if(x=='dc'){
    if(num){
      return(2)
    }
    else{
      return('Downcountry')
    }
  }
  else if(x=='tr'){
    if(num){
      return(3)
    }
    else{
      return('Trail')
    }
  }
}
labels <- as.factor(unlist(lapply(new_mtb_data$label, remap, F)))
n <- length(labels)
test_idx <- sort(sample(1:n, round(n/5)))

Xtest <- mtb_standard_scaled[test_idx, ]
Xtrain <-  mtb_standard_scaled[-test_idx, ]

trainSVM <- function(x, y, idx){
  xtest <- x[idx,]
  xtrain <- x[-idx,]
  ytest <- y[idx]
  ytrain <- y[-idx]
  
  clf <- svm(x=xtrain, y=ytrain)
  preds <- predict(clf, xtest)
  
  acc <- 0
  cm <- table(ytest, preds)
  for (i in 1:length(unique(labels))){
    acc <- acc + cm[i,i]
  }
  return(acc/sum(cm))
}

folds <- createFolds(labels, k=10)

accs <- c()

for (fold in folds){
  acc <- trainSVM(mtb_standard_scaled, labels, fold)
  accs <- c(accs, acc)
}

mean(accs)
## [1] 0.587619
# Roughly 60% accuracy when treating down country as separate category
# But - roughly 77% accuracy when treating down country as XC, only 65% accuracy when treating downcountry as trail, suggests that downcountry bikes are more akin to XC than they are trail
folds <- createFolds(labels, k=10)

accs_2pc <- c()

for (fold in folds){
  acc <- trainSVM(pca_2_scaled, labels, fold)
  accs_2pc <- c(accs, acc)
}

mean(accs_2pc)
## [1] 0.5861472
dat <- data.frame(x=cbind(pca_2_scaled[,2], pca_2_scaled[,1]), y=labels)
pcsvm <- svm(y~., data=dat)

# jpeg('roughSVM.jpg')
plot(pcsvm, dat)

# circle -> correctly predicted
# X -> incorrectly predicted
# black -> true XC
# Red -> true DC
# Blue -> true TR
# Green -> true EN

Conclusions

Findings

  • All results suggest that trying to discretely categorize full suspension mountain bikes is more or less arbitrary.

  • The categorization of a mountain bike should be treated as a continuous scale, with Cross Country bikes on one end and Enduro bikes on another.

  • To obtain where a specific bike lies on this scale, one can use the linear combination of the bike’s specifications and the first principle component.

Opportunities for Improved Analysis

There are a few opportunities to improve the analysis included in this presentation and forthcoming report:

  • Inclusion of more bikes (rows) | More rows = more robust clustering algorithms.

  • Inclusion of more bike features (columns) | Although we included the most meaningful specs/geometry of the bikes analyzed, there are dozens of other, smaller features that can be used to help differentiate between different types of bikes.

  • Include all sizes of bikes | We chose to use the size that corresponded to a 5’10” rider, but some bike manufacturers could interpret this as a Medium and others a Large.